Journal of Molecular Evolution — Latest Matching Preprints

1

Evolutionary Stratification of Codon Usage Bias In Plants Arises from GC3 Composition and Translational Optimization

Mohanta, T. K.

2026-07-01 genomics 10.64898/2026.06.26.734692 medRxiv

Top 0.1%

9.6%

Show abstract

Codon usage bias is a fundamental genomic characteristic that prefers non-random preferential use of synonymous codons. It is a major determinant of translational efficiency, gene regulation, and molecular evolution. However, the evolutionary bias and functional relevance of codon usage bias across the plant lineage is poorly defined and yet to understand what are the major factors responsible for relative synonymous codon usage (RSCU) in genomes and how codon usage bias influences the gene regulation, molecular evolution genomes. A genome-wide codon usage bias study of coding DNA sequences of 262 plant genome was conducted. It encompassed more than 4.6 billion codons from > 11 million coding sequences. Relative synonymous codon usage, codon adaptation index, codon-anticodon mapping, effective number of codon (ENC)-GC3, GC1,2-GC3, parity rule 2 (PR2-bias), molecular economy, and machine learning approaches were used for the study. It was found that codon usage bias was strongly non-random and exhibited a clear phylogenetic structuring. The higher plants favoured A/T-ending, whereas early-diverging lineages were enriched in G/C-ending codons. Analysis of RSCU, codon adaptation index, and codon-anticodon pairing indicated that translational selection is mediated by tRNA availability, contributing sustainability to these molecular patterns. Machine-learning approaches identified a small subset of codons having outsized influence on genome-wide codon usage landscapes. Further studies revealed the presence of robust inverse relationships between the effective number of codons and GC content at synonymous third positions. Neutrality analysis revealed approximately 61% of variation was driven by mutational pressure, tempered by selective constraints. Phylogenetic reconstruction showed a progressive relaxation of codon bias from algae to angiosperms while maintaining a conserved molecular economy cost of ~ 30 ATP per codon across the lineages. The study revealed codon usage bias is lineage-specific evolutionary conserved trait governed by mutation, selection, and translational optimization.

2

Evo 2's Perception of Single Nucleotide Substitutions in the Genes of Two Plant Model Organisms

Mantegazza, O.; Bertolini, L.; Leoni, G.; Colaiacovo, M.; Petrillo, M.; Bonfini, L.; Savini, C.; Ceresa, M.; Zaoui, X.

2026-07-03 genomics 10.64898/2026.07.01.729829 medRxiv

Top 0.1%

2.9%

Show abstract

Although DNA Large Language Models (DNA-LLMs) offer a path to decoding genetic complexity, our ability to evaluate these models is constrained by our incomplete understanding of the very same genetic syntax and functional logic that these models are trained to learn. In this study we use single nucleotide substitutions that have or have not been observed in living organisms, to evaluate how the DNA-LLM Evo 2 interprets gene sequences from two plant model organisms, Arabidopsis thaliana and Oryza sativa japonica. Using perplexity as a measure of the model's confidence, we observe that alleles containing simulated substitutions are perceived, on average, as less likely than those observed in vivo. Although the size of the effect is modest, the effect is statistically significant and robust, suggesting that Evo 2 is aligned with our current understanding of evolutionary selective constraints. This approach is designed to be model-agnostic and species-agnostic and could serve as a generic framework for evaluating the performance of DNA-LLMs.

3

Patterns of molecular conservation along tooth development are only partly shaped by evolutionary pressures on tooth

Ganofsky, J.; Estevez-Villar, M.; Mouginot, M.; Moretti, S.; Nyamari, M.; Robinson-Rechavi, M.; Pantalacci, S.; Semon, M.

2026-06-19 evolutionary biology 10.64898/2026.06.19.733320 medRxiv

Top 0.1%

2.6%

Show abstract

Although it is well established that certain stages of development are molecularly more conserved than others, the reasons for this phenomenon remain largely unknown. We study molecular conservation in the development of an organ, the molar, by comparing the temporal profiles of expression in mice and hamsters. We find that the cause of conservation of expression and of coding sequences changes over molar development. Gene expression levels display a classical increase of divergence as development progresses. In terms of genes expressed, the composition of early and late stages is better conserved and enriched in pleiotropic genes, yet each stage mobilizes different sets of pleiotropic genes, cell division for bud growth and secretion for tooth mineralization. Moreover similar patterns of higher divergence of gene sets and of coding sequences at mid development, are caused by different biological phenomena, in that case heterochronies and blood colonisation respectively. In conclusion, the patterns of molecular conservation in developing molars are shaped by a combination of processes intrinsic to the teeth, and by negative and positive selection on functions which are mostly extrinsic to the teeth. This is likely translatable to explain molecular conservation patterns in many other biological systems. AUTHOR SUMMARYFor species to evolve different adaptations to different life styles, their anatomy has to evolve correspondingly. This in turn implies evolution of the embryonic development of anatomical structures. Notably, tooth shape can evolve rapidly as an adaptation to different diets. Mice and hamsters are closely related rodents who yet differ in the shape of their molars, and thus in their development. In this study, we investigated why the genes active in molar development are more or less similar between the two species from early tooth bud to fully formed embryo molar. We found that early and late molar development were slow evolving, while mid-development was evolving faster. But surprisingly, this was in part due not to tooth evolution, but to the involvement of genes which are active in other processes in the body. For example an influx of immune cells also brings fast evolving immune genes. This helps us understand better the complexity of causes of apparently simple evolutionary patterns.

4

Gene model for the ortholog of Lst8 in Drosophila yakuba

Lawson, M. E.; Sanow, K. A.; Chetana, K.; Taylor, E.; Morgan, A.; Flannery, D.; Elsie, C.; Rele, C. P.; Reed, L. K.; O'Rourke, K. S.

2026-05-14 genomics 10.64898/2026.05.12.723325 medRxiv

Top 0.1%

2.4%

Show abstract

Gene model for the ortholog of Lst8 (Lst8) in the May 2011 (WUGSC dyak_caf1/DyakCAF1) Genome Assembly (GenBank Accession: GCA_000005975.1) of Drosophila yakuba. This ortholog was characterized as part of a developing dataset to study the evolution of the Insulin/insulin-like growth factor signaling pathway (IIS) across the genus Drosophila using the Genomics Education Partnership gene annotation protocol for Course-based Undergraduate Research Experiences.

5

Identifying transcriptomic bias across developmental shifts in insects

Cornet, S.; Dennis, A. B.

2026-06-14 evolutionary biology 10.64898/2026.06.12.731678 medRxiv

Top 0.1%

2.4%

Show abstract

BackgroundSynonymous mutations, once considered neutral, can affect translation efficiency through mRNA folding and splicing, generating codon usage bias. This bias is often linked to genomic GC content, which also influences gene regulation. In the parasitoid wasp Lysiphlebus fabarum, GC content was previously shown to shift between developmental stages, with larvae showing higher GC than adults. Whether this phenomenon is widespread among insects remains unknown. ResultsTranscriptomic data from six insect species spanning Diptera, Hymenoptera, and Lepidoptera was used to compare GC content between expressed genes in larvae and adults. In five species, larval transcripts exhibited higher GC content than adult transcripts. Differential expression analysis revealed that stage-biased genes displayed consistent GC shifts, and orthologous gene families with representatives across species showed particularly GC-rich larval-biased genes in Hymenoptera and Diptera. At the genome scale, modeling in 317 insect species demonstrated an association between parasitic lifestyle and reduced mean GC content in Hymenoptera and Diptera, providing a possible ecological explanation for AT-rich genomes. ConclusionsOur results show that GC content is dynamic across developmental stages, independent of overall genome composition. Stage-specific GC enrichment may reflect adaptive codon usage optimizing translation during energetically demanding life-history stages such as larval development. Furthermore, the association between parasitism and reduced genomic GC highlights how ecological lifestyle might with genome content and evolution. Lastly, this work identifies candidate genes underlying stage-specific GC bias and provides new insights into the interplay between molecular evolution, development, and parasitic adaptation in insects.

6

Location dependence of protein intrinsic disorder in Drosophila melanogaster

Abdulla Daanaa, H. S.; Kuraku, S.; Akashi, H.; Saito, K.

2026-07-03 bioinformatics 10.64898/2026.07.02.732782 medRxiv

Top 0.2%

1.9%

Show abstract

The relevance of protein structural flexibility in function remains contested, but experimental and computational evidence continues to accumulate. Many efforts to address this investigate intrinsic disorder, which commonly refers to peptide segments or entire protein sequences that presumably lack structure and exhibit high flexibility/conformational heterogeneity under physiological conditions. These efforts face challenges such as conflicting computational predictions and ambiguous relationships among intrinsic disorder locations and other protein properties. We address these challenges at a genome-wide scale in Drosophila melanogaster using residue-level predictions for various protein properties. We employ single and consensus approaches to quantify the prevalence of intrinsic disorder and attempt to infer function by testing for differences along protein sequences. Intrinsic disorder is likely more common at terminals than internal regions, and amino acid frequencies can vary substantially between regions in a manner that plausibly reflects functions of intrinsic disorder, rather than only proteome-wide effects. Tertiary structure potentially underlies the prevalence of intrinsic disorder along protein sequences; this prevalence varies more in a putatively solvent-exposed context than a solvent-buried one. Protein-binding appears to be a main function of intrinsic disorder, and we find support consistent with the notion that structural flexibility fosters binding plasticity, and show that location and protein length are factors in this relationship. Nucleic acid-binding and linker are ostensibly less common disorder functions than protein-binding, but nucleic acid-binding seems more localized at terminals. Residue-level estimates of selection pressure indicate that disordered regions generally evolve under weaker sequence constraints than structured regions, except at the N-terminal region. Biases in disorder prediction are a considerable factor in many of the observations, but unlikely a full explanation. The findings strengthen support for functional relevance of flexibility, offer insight into protein architecture and function, and lend impetus for experimental inquiry.

7

In silico restriction site analysis of whole genome sequences shows patterns caused by selection and sequence duplications

Vedder, L.; Schoof, H.

2026-05-16 genomics 10.64898/2026.05.15.725336 medRxiv

Top 0.2%

1.8%

Show abstract

Biological sequences are known to be not random. Thus, the comparison of in silico restriction fragment distributions of random and biological sequences may be an indicator of this non-randomness. Our analyses show that for most of the tested combinations of restriction enzyme and genome sequence the fragments per Megabase of the biological sequence deviate at least more then 10% from the corresponding random sequence. This deviation goes into both directions, i.e. clearly increased values are as common as clearly decreased values. Although there is no species- or restriction-enzyme-specific effect, a clear impact of the GC content both of the restriction site and of the genome sequence can be seen. In contrast to the random sequences, the genome sequences show distinct peaks in their fragment length distributions, hinting to repetitive elements such as transposons.

8

Gene model for the ortholog of tgo in Drosophila busckii

Perez, J.; Giunta, A. A.; Wittke-Thompson, J. K.

2026-07-01 genomics 10.64898/2026.06.26.734908 medRxiv

Top 0.2%

1.7%

Show abstract

Gene model for the ortholog of tango (tgo) in the Sep. 2015 (UC Berkeley ASM127793v1/DbusGB1) Genome Assembly (GenBank Accession: GCA_001277935.1) of Drosophila busckii. This ortholog was characterized as part of a developing dataset to study the evolution of the Insulin/insulin-like growth factor signaling pathway (IIS) across the genus Drosophila using the Genomics Education Partnership gene annotation protocol for Course-based Undergraduate Research Experiences.

9

Gene model for the ortholog of DENR in Drosophila eugracilis

Lawson, M. E.; Sanow, K. A.; Martinand, I.; Fratian, M.; Matura, M.; Rele, C. P.; Reed, L. K.; Thompson, J. S.; O'Rourke, K. S.

2026-06-26 genomics 10.64898/2026.06.23.734050 medRxiv

Top 0.2%

1.7%

Show abstract

Gene model for the ortholog of Density regulated protein (DENR) in the Apr. 2013 (BCM-HGSC/Deug_2.0) (DeugGB2) Genome Assembly (GenBank Accession: GCA_000236325.2) of D. eugracilis. This ortholog was characterized as part of a developing dataset to study the evolution of the Insulin/insulin-like growth factor signaling pathway (IIS) across the genus Drosophila using the Genomics Education Partnership gene annotation protocol for Course-based Undergraduate Research Experiences.

10

Genome quality variation across Scyphozoa and the comparative distribution of retinoid- and AhR-related gene families.

Park, Y.-J.; Lee, N.; JO, Y.; Yum, S.; Kwon, K. K.

2026-04-23 evolutionary biology 10.64898/2026.04.22.720242 medRxiv

Top 0.2%

1.7%

Show abstract

Scyphozoan jellyfish have a complex life cycle that includes a characteristic transition known as strobilation. Retinoid signaling has been suggested to be involved in jellyfish metamorphosis and development. However, the genomic basis of signaling pathways associated with metamorphosis has not been sufficiently compared at the class level. Experimental studies have reported that indole compounds can induce metamorphosis in some jellyfish species. Indole- and tryptophan-derived metabolites are known to function as ligands for the aryl hydrocarbon receptor (AhR) in other organisms. However, the potential role of AhR signaling in jellyfish metamorphosis has not been previously explored. We compared the distribution of retinoid- and AhR-associated gene families across multiple scyphozoan genomes. This analysis aimed to characterize their distribution patterns in relation to signaling pathways associated with development and environmental responses. A standard gene prediction and annotation pipeline was applied to 20 species from 21 publicly available scyphozoan reference genome assemblies retrieved from the NCBI database. The distribution and copy number of these gene families were compared across species. Retinoid-associated gene families were detected across almost all Scyphozoa genomes, and core components of AhR signaling (AhR, ARNT) were identified in most species. These results suggest that scyphozoan genomes contain genetic components of retinoid- and AhR-related signals. This study presents the distribution of gene families related to developmental signaling across Scyphozoa using a comparative genomic approach. It does not imply direct functional involvement of retinoid or AhR signaling, but instead focuses on potential signaling pathways at the genome level. It also provides an overview of currently available scyphozoan genomic data. These findings provide a basis for future hypothesis generation and functional validation in jellyfish metamorphosis research.

11

Comparison of directional random walk and weighted least squares modeling of sparse fossil data

Ergon, R.

2026-07-01 evolutionary biology 10.64898/2026.06.26.734751 medRxiv

Top 0.2%

1.7%

Show abstract

The general random walk model (GRW) of Hunt (2006) is used to infer directional evolution in mean trait values from sparse fossil data by modeling phenotypic change as the accumulated result of small steps with mean step sizes and step variances. Using simulations and real data cases, Ergon (2026) showed that the step variances can be estimated reasonably well only when the mean trait values have small measurement errors, while for fossil data with realistic measurement errors they appear to be extremely difficult to find, and they are often found to be negative. In the simulations Ergon (2026) assumed that the true phenotypic mean values were known. Here, I essentially repeat these simulations under the assumption that only mean trait values with large measurement errors are known, and based on weighted mean squared error (WMSE) comparisons the conclusion is that weighted least squares (WLS) is a better method than GRW. A second conclusion is that WLS is a better method also in the possibly rare cases with large measurement errors where the GRW parameters are estimated well. The GRW method is simply not flexible enough to handle such cases. A third conclusion is that Akaike Information Criterion (AIC) results for GRW models with large measurement errors relative to the step variance may be overly optimistic.

12

Gene model for the ortholog of raptor in Drosophila erecta

Backlund, A. E.; Nielsen, J.; Pulford, J.; Cook, B.; Anderson, J.; Robert, M.; Thompson, J. S.; Rele, C. P.; Wittke-Thompson, J. K.

2026-07-14 genomics 10.64898/2026.07.09.737526 medRxiv

Top 0.2%

1.5%

Show abstract

Gene model for the ortholog of raptor in the May 2011 (Agencourt Dere_CAF1/DereCAF1) Genome Assembly (GenBank Accession: GCA_000005135.1) of Drosophila erecta. This ortholog was characterized as part of a developing dataset to study the evolution of the Insulin/insulin-like growth factor signaling pathway (IIS) across the genus Drosophila using the Genomics Education Partnership gene annotation protocol for Course-based Undergraduate Research Experiences.

13

Generative continuous time model reveals epistatic signatures in protein evolution

Pagnani, A.; Barrat-Charlaix, P.

2026-07-10 bioinformatics 10.1101/2025.09.17.676821 medRxiv

Top 0.3%

1.3%

Show abstract

Protein evolution is fundamentally shaped by epistasis, where the effect of a mutation depends on the sequence context. As standard phylogenetic methods assume independently evolving sites, there is a need for more complex models based on accurate estimations of the fitness landscape. Good candidates are modern generative models -- such as the Potts model -- which successfully capture epistatic effects. However, recent work on generative evolutionary models usually use discrete time, making them difficult to integrate with the standard frameworks in evolutionary biology. We introduce a continuous-time sequence evolution model using the Gillespie algorithm and parameterized by a generative Potts model. This approach enables us to simulate realistic, family-specific evolutionary trajectories and allows for direct comparison with independent-site models. Surprisingly, we find that while epistasis significantly slows down evolution, it does not change the average evolutionary rates at individual sites. This is explained by the rate heterogeneity caused by context-dependence: we show that the rate at some positions varies between null to high values depending on the context, while other positions are essentially independent from the context. Finally, we show that epistasis leads to a systematic underestimation bias in the inference of evolutionary distance between sequences. Overall, our work provides a new tool for simulating realistic protein evolution and offers novel insights into the complex interplay between epistasis and evolutionary dynamics.

14

Expression-dependent but strand-independent synonymous single-nucleotide polymorphism in the Escherichia coli chromosome

Deka, N.; Beura, P. K.; Sen, P.; Aziz, R.; Kashyap, A.; Keot, D.; Jain, M.; Namsa, N. D.; Deka, R. C.; Feil, E.; Satapathy, S. S.; Ray, S. K.

2026-05-26 evolutionary biology 10.64898/2026.05.22.727198 medRxiv

Top 0.3%

1.1%

Show abstract

BackgroundMutation is thought to arise mainly during replication, though transcription is also known to be mutagenic. Considering the recent reports regarding genome-wide transcription-induced mutagenesis, a distinct demonstration of specific mutation being replication-dependent and/or transcription-dependent in genomes is yet to be established. Here, we studied synonymous single-nucleotide polymorphisms (SNPs) in 2091 individual coding sequences (CDS) in the leading strand (LeS) and the lagging strand (LaS) of the Escherichia coli chromosome by comparing across 157 strains. The frequencies of complementary transitions (ti) and complementary transversions (tv) were compared in each CDS to assess parity violation in the strands. ResultsThe C[->]T and G[->]A exhibited the maximum frequency as well as the most prominent strand inequality as these tis were influenced both by the strands as well as by the expression. Interestingly, inequality between T[->]C and A[->]G was expression-dependent but strand-independent. A[->]T and G[->]T tvs were universally more frequent than their complementary T[->]A and C[->]A tvs, respectively. ConclusionsOur study demonstrates strand-independent but expression-dependent synonymous SNP inequality in CDS, supporting the role of transcription-induced mutagenesis contributing to strand inequality in the E. coli chromosome.

15

Mutational and bioinformatic analysis of the binding site for the ribonucleotide reductase-specific transcriptional repressor NrdR

Shahid, S.; Lundin, D.; Rozman Grinberg, I.; Sjöberg, B.-M.

2026-05-14 molecular biology 10.64898/2026.05.11.724285 medRxiv

Top 0.3%

1.1%

Show abstract

The prevalent transcriptional repressor NrdR binds to highly conserved prokaryotic sequences in the promoter regions of operons encoding the essential enzyme ribonucleotide reductase. The NrdR binding sites consist of two partially palindromic 16 bp sequences (NrdR boxes) separated by a 15-16 bp linker sequence. We have assessed the requirement of both boxes for binding, the propensity of different NrdRs to bind to heterologous binding sites, and that the linker sequence is only limited to length and not sequence conservation. As we have observed several deviations from the conserved sequences of the NrdR boxes, we here test the conservation requirements of individual basepairs in the NrdR boxes using a synthetic DNA fragment (Synt DNA) to which the NrdR proteins from the actinomycete Streptomyces coelicolor and the gammaproteobacterium Escherichia coli bind equally well as to their homologous binding sites. By introducing isolated mutations to Synt DNA and testing the binding capacity of NrdR from S. coelicolor and E. coli we expand our understanding of what criteria are needed to build a functional binding site for the NrdR repressor.

16

Gene model for the ortholog of raptor in Drosophila grimshawi

Lieser, B. C.; Lose, B.; Kiser, C. A.; Butterfield, S.; Laschober, L.; Laskowski, L. F.; Nielsen, J.; Pulford, J.; Thompson, J. S.; Rele, C. P.; Wittke-Thompson, J. K.

2026-07-11 genomics 10.64898/2026.07.07.737051 medRxiv

Top 0.3%

1.1%

Show abstract

Gene model for the ortholog of raptor in the D. grimshawi May 2011 (Agencourt dgri_caf1/DgriCAF1) Genome Assembly (GenBank Accession: GCA_000005155.1) of Drosophila grimshawi. This ortholog was characterized as part of a developing dataset to study the evolution of the Insulin/insulin-like growth factor signaling pathway (IIS) across the genus Drosophila using the Genomics Education Partnership gene annotation protocol for Course-based Undergraduate Research Experiences.

17

Structural distance at the tRNA synthetase active site interface predicts pathogenicity but is captured by AlphaMissense and EVE except among score-ambiguous variants

Liebeskind, K.; Francklyn, C.; Barrantes Reynolds, R.

2026-05-26 bioinformatics 10.64898/2026.05.22.727252 medRxiv

Top 0.3%

1.1%

Show abstract

Variants of uncertain significance have accumulated as genomic sequencing has become more widespread, which complicates rare disease diagnosis and requires substantial resources for re-evaluation. Aminoacyl-tRNA synthetases (ARSs) are a protein family with extensive variant data and well-characterized disease associations, making them an ideal system for investigating the relationship between variant location and pathogenicity. Using structural distance measurements to the ARS-tRNA binding interface combined with existing pathogenicity predictors, AlphaMissense and EVE, we investigated whether explicit structural binding information could improve missense variant pathogenicity prediction. Pathogenic variants were found to cluster significantly closer to the tRNA-binding interface than benign variants (p = 0.0003). Incorporating explicit distance information into a Bayesian mixture model did not substantially improve predictive performance over AlphaMissense and EVE alone, suggesting that these models already implicitly capture relevant structural binding context. However, a clinically important subset of interface variants classified as ambiguous by both existing models identifies a specific gap where explicit structural distance information may provide added discriminative value, but the limited number of clinically validated variants currently available constrains the ability to fully evaluate this potential. Incorporating additional biologically relevant features not captured by existing models, such as protein stability or conformational dynamics, as well as refining structural distance calculations, may further improve classification of this subset. These findings highlight both the power and the limitations of existing pathogenicity predictors and suggest that structurally informed approaches targeting the binding interface represent a promising direction for improving classification of these ambiguous variants that have great clinical significance. Author SummaryAdvances in clinical genetic sequencing have caused increasing identification of genetic variants whose impact on human health is unknown. These "variants of uncertain significance" present a major challenge because their role in causing disease cannot yet be confirmed or ruled out. This study focuses on a specific family of essential enzymes called aminoacyl-tRNA synthetases, which play a critical role in the process of proteins translation. Mutations in these enzymes have been linked to a range of diseases. This project aims to provide a novel method for determining pathogenicity of variants specifically in aminoacyl-tRNA synthetases. We propose that physical proximity of a variant to the functional binding site of the enzyme is influential in determining pathogenicity. We find that this spatial relationship is a meaningful indicator of a variants potential to disrupt normal function.

18

Secondary structure distances reveal a new dimension of protein evolution

Bastida, A.; Mun oz Morales, A. M.; Egea-Cortines, M.

2026-05-01 evolutionary biology 10.64898/2026.04.29.721599 medRxiv

Top 0.3%

1.1%

Show abstract

Molecular phylogenetics based on primary sequence comparisons has been central to reconstructing protein evolution. However, structural evolution does not necessarily parallel sequence divergence, particularly in proteins combining ordered domains with intrinsically disordered regions (IDRs). Here, we introduce a quantitative secondary structure distance (S2D) metric that enables systematic comparison of protein secondary structure, including both ordered elements and IDRs. Using the MADS-box transcription factor family as a model, we show that structural divergence is domain-specific and only partially coupled to sequence-based phylogeny. Domain-resolved analyses reveal that the DNA-binding M domain remains structurally constrained, whereas the I and C domains exhibit extensive sequence divergence while retaining conserved intrinsic disorder. In contrast, the K domain contributes disproportionately to global structural variability. Integrating S2D with phylogenetic distance uncovers both convergent structural architectures among distantly related proteins and pronounced structural remodelling within closely related paralogs--patterns not evident from primary sequence comparisons alone. Residue-level analyses further demonstrate that the structural impact of mutation depends strongly on amino acid identity and does not scale directly with substitution frequency or conservation metrics. Together, these findings indicate that secondary structural evolution provides an additional dimension of protein diversification beyond sequence divergence. By integrating phylogenetic and structural distances, this framework offers a complementary approach to interpreting protein evolution, particularly in families containing mixtures of ordered domains and intrinsically disordered regions. Significance StatementEvolutionary relationships are typically inferred from primary sequence comparisons, yet structural evolution may follow different trajectories. By developing a quantitative measure of secondary structural divergence, we show that structural change within the MADS-box transcription factor family can both converge and diverge independently of sequence-based phylogeny. Intrinsically disordered regions exhibit extensive sequence divergence while retaining conserved disorder, whereas specific amino acid substitutions disproportionately reshape secondary structure. These findings demonstrate that evolutionary diversification operates through domain-specific structural modulation rather than uniform sequence divergence. Integrating structural and phylogenetic distances provides a complementary framework for interpreting protein evolution and reveals evolutionary patterns that remain hidden when relying on sequence comparisons alone.

19

Testing the reliability of AI-generated protein structures

Xu, A.; Salzberg, S.

2026-06-13 bioinformatics 10.64898/2026.06.11.731682 medRxiv

Top 0.3%

1.0%

Show abstract

Although AlphaFold2 and its competitors have demonstrated remarkable abilities to predict protein structure, more work is needed to explore the limitations of these methods. Here we investigated the reliability of AlphaFold2 and ColabFold by creating a set of realistic but false protein sequences, using ColabFold to predict their structure, and then asking how often the program produces a high-scoring structure for a sequence that does not represent a protein. We determined that AlphaFold2 has a very small but non-zero false positive rate, estimated here at approximately 1 in 435 if one uses a threshold pLDDT score of 70 to define positive predictions. We also discovered, serendipitously, that some high-scoring sequences in the human genome were not false positives, but instead were previously unknown and un-annotated pseudogenes. These latter findings indicate that some well-established human annotations of protein-coding genes may have incorrectly extended the 5 untranslated regions too far. They also suggest that AlphaFold2s false positive rate is low enough that almost any high-scoring structure, even in a noncoding region, is worthy of further investigation.

20

Multiple molecular and cellular properties jointly affect protein and site-specific evolutionary rates

Saini, A.; Usmanova, D. R.; Supo Escalante, R.; Vitkup, D.

2026-05-23 evolutionary biology 10.64898/2026.05.20.726710 medRxiv

Top 0.3%

1.0%

Show abstract

Protein evolutionary rates vary widely across proteins and among sites within proteins, reflecting multiple molecular, cellular, and functional constraints. While protein-level properties, such as expression and essentiality, and site-level structural and functional constraints, are known to influence evolutionary rates, how these constraints combine across scales to determine site-specific evolutionary rates remains unclear. Moreover, because many protein features are strongly correlated, it is difficult to disentangle their individual contributions to evolutionary rate variance, and unified predictive models that integrate these properties are still lacking. Here, we use neural networks to predict protein evolutionary rates across multiple scales based on multiple molecular and cellular features. At the protein level, integrating molecular and cellular descriptors explains substantial variance in evolutionary rates across proteins in multiple eukaryotic species, including nearly 50% of the variance in humans and substantial fractions of the variance in other eukaryotic species. The model also allows us to identify proteins whose evolutionary rates deviate from expectations based on their molecular and cellular properties. At the site level, we found that structural and functional features explain a comparable fraction of the variance in relative evolutionary rates. By integrating protein-level and site-level predictors, the model explains up to 37% of the variance in site-specific evolutionary rates across proteins. Our analysis demonstrates that constraints at these two scales combine largely additively, with protein-level properties setting the overall evolutionary context and site-level properties shaping variation within proteins. Together, these results provide a quantitative framework for understanding protein evolution across biological scales.